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Abstract — We consider the problem of inferring the topology 
of an Af -by-AT network by sending probes between M sources 
and AT receivers. Prior work has shown that this problem can 
be decomposed into two parts: first, infer smaller subnetwork 
components (i.e., l-by-AT's or 2-by-2's) and then merge these 
components to identify the M-by-AT topology. In this paper, we 
focus on the second part. In particular, we assume that a l-by-AT 
topology is given and that all 2 -by- 2 components can be queried 
and learned using end-to-end probes. The problem is which 2- 
by-2's to query and how to merge them with the l-by-AT, so as 
to exactly identify the 2 -by- AT topology, and optimize a number 
of performance metrics including measurement traffic, time com- 
plexity, and memory usage. We provide a lower bound, [^], on 
the number of 2-by-2's required by any active learning algorithm 
and propose two greedy algorithms. The first algorithm follows a 
bottom-up approach: at every step, it selects two receivers, queries 
the corresponding 2 -by- 2, and merges it with the given l-by-AT; 
it requires exactly AT — 1 steps, which is much less than all (^) 
possible 2-by-2's. The second algorithm follows the framework 
of multiple hypothesis testing, in particular Generalized Binary 
Search (GBS). Simulation results over synthetic and realistic 
topologies demonstrate that both algorithms correctly identify 
the 2-by-AT topology and are near-optimal, but the bottom-up 
approach is more efficient in practice. 

I. Introduction 

Knowledge of network topology is important for network 
management, diagnosis, operation, security, and performance 
optimization l^M- In this paper, we consider a tomographic 
approach to topology inference, which assumes no cooperation 
from intermediate nodes and relies on end-to-end probes 
to infer internal network characteristics, including topology 
10]. Typically, multicast or unicast probes are sent/received 
between sets of sources/receivers at the edge of the network, 
and the topology is inferred based on the number and order 
of received probes, or more generally, using some metric 
or correlation structure. An important performance metric is 
measurement bandwidth overhead: it is desirable to accurately 
infer the topology using a small number of probes. 

In this paper, we focus on the problem of multiple- source 
multiple-destination topology inference: our goal is to infer 
the internal network (M-by-AT) topology by sending probes 
between M sources and AT receivers at the edge of the 
network. Prior work has shown that this problem can 

be decomposed into two parts: first, infer smaller subnetwork 
components (e.g., multiple l-by-AT's or 2-by-2's) and then 
merge them to identify the entire M-by-AT topology. 

Significant progress has been made over the past years on 
the decomposition and the first part of the problem, i.e., infer- 
ring smaller components (l-by-AT's or 2-by-2's) using active 



probes. One body of work developed techniques for inferring 
l-by-AT (i.e., single-source tree) topologies using end-to-end 
measurements IITl-flSll. Follow-up work showed that an 

M-by-AT topology can be decomposed into/reconstructed from 
a number of two-source, two-receiver (2-by-2) subnetwork 
components or "quartets". In (H Hll, a practical scheme was 
proposed to distinguish between some quartet topologies using 
back- to-back unicast probes. In our recent work JlBl [ItI, we 
proposed a method to exactly identify the topology of a quartet 
in networks with multicast and network coding capabilities. 

In this paper, we focus on the second part of the problem, 
namely selecting and merging smaller subnetwork components 
to exactly identify the M-by-A", which has received signif- 
icantly less attention than the first part. Existing approaches 
developed for merging the quartets JH HI have several limita- 
tions, including not being able to exactly identify the M-by-AT 
topology and/or being inefficient (e.g., requiring to send probes 
over all (^) possible quartets). In this paper, we formulate the 
problem as active learning, characterize its complexity, and fol- 
low principled approaches for designing efficient algorithms to 
solve it. This complexity is important from both theoretical (a 
fundamental property of the topology inference problem) and 
practical (it determines the measurement bandwidth overhead, 
running time and memory usage) points of view. These costs 
can become particularly important when we need to infer large 
or dynamic topologies using active measurements. 

More specifically, we start from the problem of 2-by-AT 
topology inference, which is an important special case and 
can then be used as building block for inferring an M-by-AT. 
Consistently with [1], we assume that a (static) l-by-AT topol- 
ogy is known (e.g., using one of the methods in ll4l7l 4l5l fl8n) 
and that the topology of a quartet component can be queried 
and learned, if so desired (^., u sing end-to-end probes and 
some of the methods in CI II IS O [BSEl). The problem 
then becomes one of active learning: which quartets to query 
and how to merge them with the given l-by-AT, so as to exactly 
identify the 2-by-AT and optimize several performance metrics 
including measurement bandwidth, merging complexity and 
memory usage. Our contributions are as follows: 

1) We provide a lower bound of [y] on the number of 
quartets required by any active learning algorithm in order to 
identify the 2-by-AT. This characterizes the inherent complexity 
of the problem and also serves as a rough baseline for assessing 

^ Other techniques may also be developed in the future: this is still an active 
research area. But this is out of the scope of this paper (see Section ITTTl. 
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the performance of practical algorithms. 

2) We design an efficient merging algorithm that follows a 
greedy bottom-up approach and provably identifies the 2-by-A^ 
by querying exactly N — 1 quartets. From the active probing 
perspective, this is attractive since only — 1 queries are 
required, which is much lower than all (^) possible queries. 

3) We also formulate the problem within the framework 
of multiple hypothesis testing and develop an active learning 
algorithm based on Generalized Binary Search (GBS). 

We compare the two algorithms to each other and to 
the lower bound via simulations over synthetic and realistic 
topologies. The results show that both algorithms can exactly 
identify the topology and are near-optimal in terms of active 
measurement bandwidth. Between the two, the bottom-up al- 
gorithm is very efficient in terms of running time and memory 
usage, and thus recommended for practical implementation. 

The rest of the paper is organized as follows. Section |IT] 
summarizes related work. Section HIT] provides the problem 
statement and terminology. Section |IV] provides a lower bound 
on the number of quartets required by any algorithm. Section |Vl 
proposes an efficient bottom-up algorithm and analyzes its 
correctness and performance. Section |Vl] proposes another 
greedy algorithm based on the GBS framework. Section I VII I 
evaluates the two algorithms through simulations. Section IVlIIl 
discusses possible extensions. Section HXl concludes the paper. 

II. Related Work 

There is a large body of prior work on inference of network 
topology. The most closely related to this paper are the ones 
using active measurements and network tomography. 

One family of techniques relies on cooperation of nodes 
in the middle of the network, and uses traceroute ||2Q| - [23Ii 
measurements to collect the ids of nodes along paths. However, 
some nodes may not respond and nodes often have multiple 
network interfaces (ids). Thus, trace route-based methods 
must deal with missing/incomplete data and alias problems. 

Unlike traceroute, tomographic approaches do not rely 
on responses from intermediate nodes, but only on end-to-end 
measurements. A survey of network tomography can be found 
in 10] . Most tomographic approaches rely on probes sent from 
a single source in a tree topology [ItUisI] and feed the number, 
order, or a monotonic property of received probes as input to 
statistical signal-processing techniques. 

In i^^, the authors formulated the multiple source multiple 
destination (M-by-A^) tomography problem by sending probes 
between M sources and N receivers. It was shown that an M- 
hy-N network can be decomposed into a collection of 2-by-2 
components, also referred to as quartets JH Coordinated 
transmission of back-to-back unicast probes from 2 sources 
and packet arrival order measurements at the 2 receivers were 
used to infer some information about the quartet topology. As- 
suming knowledge of M l-by-A" topologies and the quartets, 
it was also shown how to merge a second source's 1-by-A^ 
tree with the first one. The resulting M-by-A^ is not exact, but 
bounds were provided on the locations of the points where the 
two 1-by-A^ trees merge with each other. This approach also 
requires a large number of probes for statistical significance, 
similar to many other methods [ItI-ITTI]. Compared to lU], our 
work is different in that (i) we assume perfect knowledge of the 




Fig. 1. An example 2-by-4 topology. Solid lines and branching points Bij's 
depict GsixTZ- Ji a joining point, where P2i (shown by dashed lines) joins 
GsixTZ- An example quartet is the part of the network connecting Si, S2 to 
Ri,R2, which is type 1 since both Ji, J2 lie above the branching point of 
Ri,R2 in Gs-^xn, i-e., Si, 2. 

quartets, thus we identify the topology accurately; (ii) we focus 
on the efficiency of active learning, i.e., selecting and merging 
the quartets, which has not been studied before. To the best of 
our knowledge, the only other merging algorithm proposed in 
the literature is (H Hh . However, the merging was not efficient 
since all possible quartets were queried exhaustively. 

In our prior work JIH [ItI], we revisited the problem of 
topology inference using end-to-end probes in networks where 
internal nodes are equipped with multicast and network coding 
capabilities. We built on [HI] and extended it, using network 
coding at internal nodes to deterministically distinguish among 
all possible quartet topologies, which was not possible before. 
While in [TtIi, we focused on inferring quartets fast and 
accurately, here we assume that any quartet can be queried and 
learned, and focus on efficiently selecting and merging quartets 
to infer the larger topology. To the best of our knowledge, we 
are the first to look at this aspect of the problem. 

There also exists a rich body of work on multiple hypothesis 
testing. One of the contributions of this paper is to formulate 
this problem in that framework and design an algorithm based 
on GBS Ill-Ill], which we describe in detail in Section Ivll 

Topology inference problems have also been studied in the 
context of phylogenetic trees iUtI IH]. iH] built on [28] and 
proposed robust algorithms for multiple source tree topology 
inference. Jsl] inferred the topology of sparse random graphs. 
However, the quartet structures and the way we measure them 
are different in our case due to the nature of active probing in 
network tomography (see problem formulation in Section IlIIl). 

III. Problem Statement 

M-hy-N Topology to be inferred. Consider an M-by-A^ 
topology as a directed acyclic graph (DAG), between M source 
nodes S = {^i, Sm} and N receivers 1Z = Rn}- 
We denote this M-hy-N topology by Gs xjz- Note that Gsixn, 
i = 1, M, is a l-by-7V tree. Similar to we assume that 

a predetermined routing policy maps each source-destination 
pair to a unique route from the source to the destination. This 
implies the following three properties, first stated in [IjH 

^These assumptions are realistic, the same as in (B-Ell? and consistent with 
the destination-based routing used in the Internet: each router decides the next 
hop taken by a packet using a routing table lookup on the destination address. 
We further assume that the network does not employ load balancing. 



(a) type 1 



(b) type 2 



(c) type 3 



(d) type 4 



Fig. 2. The four possible types of a quartet (2-by-2 subnetwork component). There are two sources Si,S2 multicasting packets xi,X2 to two receivers 
Ri,R2. AH links are directed downwards, but arrowheads are omitted to avoid cluttering. (The l-by-2 topology of Si is a tree composed of ^i, 2? ^i? ^2- 
Similarly, the l-by-2 tree rooted at S2 is S2, 2^ Fii, R2. Ji and J2 are joining points, where paths from S2 to Ri and R2 join/merge with S'l's tree.) 



Al For every source Si and every receiver Rj, there is a 

unique path Pij . 
A2 Two paths Pij and Pik, j 7^ k, branch at a branching 

point B, and they never merge again. 
A3 Two paths Pik and Pjk, i 7^ j, merge at a joining point 

J, and they never spHt again. 

We are interested in inferring the logical topologjfl, defined 
by the branching and joining points defined above. We present 
most of our discussion in terms of M = 2, i.e., inferring a 2- 
by-A^ topology Gsxn, <S = {Si,S2}; an M-by-A" topology, 
S = {Si^ ...^ Sm}, can then be constructed by merging smaller 
structures, as we describe in Section I VIII I 

Example 1: Fig. [T] illustrates an example 2-by-A^ topology 
with A^ = 4. The logical tree topology of Si is shown by solid 
lines and branching points Bij's. Each Ji depicts a joining 
point, where the path from 5*2 to receiver Ri (indicated by the 
dashed lines) joins the Si tree. For example, the path from 
5*2 to Ri joins the Si tree at a point between Bi^s and Bi^2, 
whereas the path to R4 joins at a point above ^1,4. ■ 

Quartet Components. In yj, it has been shown that an M- 
hy-N topology can be decomposed into a collection of 2-by-2 
subnetwork components, which, in this paper, we call quartets, 
following the terminology in JH @] . Each quartet can be of four 
possible types, as shown in Fig. [2l We refer to Fig. [2] (a), (b), 
(c), and (d) as types 1, 2, 3, and 4, respectively. 

In order to infer the type of a quartet between two sources 
Si , ^2 and two receivers Ri ^Rj, a set of probes must be sent 
from Si , 5*2 to Ri , Rj . The received probes can then bepro- 
cessed using techniques such as the ones developed in: IH 0] 
(which distinguish type 1 from types 2, 3, 4 by sending back- 
to-back unicast probes); JT^ (which distinguish among 
all four types exploiting multicast and network coding); [|l9[| 
(which can exactly infer the topology of a super- source to two 
receivers using network coding); trace route [20-23] from 
the two sources to the two receivers; or other techniques that 
may be developed in the future, since this is still an active 
research area. We consider the design of these techniques to 
be out of the scope of this paper and we focus on their use 
by active learning algorithms to perform a query, i.e., learn a 
quartet type by sending and processing a set of active probes. 

Being able to query the type of a quartet enables inference 
of an M-by-A" topology in two steps, as follows: first infer the 

logical topology is obtained from a physical topology by ignoring nodes 
with in-degree = out-degree = 1 . Such nodes cannot be identified and network 
tomography always focuses on inferring logical topologies. 



type of each quartet, and then merge these quartets to identify 
the original topology. Indeed, knowing the type of the quartet, 
we can use Fig. [2] to infer the relative location of joining and 
branching points. For example, knowing that the quartet is of 
type 1 implies that (i) the two joining points coincide Ji = J2, 
(ii) the two branching points coincide B\ 2 = ^i,2' (iii) the 
joining point is above the branching point. Similar inferences 
can be made from the other types. 

Problem Statement. Consistently with 111], we assume that 
Gsixn ii-e., the l-by-A" tree topology rooted at Si, which 
contains only branc hing po ints) is known {e.g., using one of 
the methods in ll4ll7l4l5lfl8ll). We also assume that the type of 
the quartet between Si, a new source S2, and any two receivers 
can be queried and learned, as explained above. 

Given (i) Gs^xn and (ii) the ability to query the quartet type 
between Si, S2, and any two receivers Ri, Rj, our goal is to 
identify all joining points, Jn = {-^1, -^2, -^at}, where the 
paths from ^2 to each receiver join the tree describing paths 
from Si to the same set of receivers. Identifying a joining point 
Ji (for receiver Ri) means locating Ji on a single logical link, 
between two branching points on Gs^xn- E.g., in Fig. [T] the 
path from 5*2 to Ri joins the Si tree at a point between nodes 
5i^3 and Bi^2\ i-e., Ji is located on the link {Bi^^^Bi^2)- 

We achieve this goal via active learning: we start from 
the given, static, 1-by-A^ topology Gs^xu and proceed by 
updating it in steps. In each step, we select which quartet to 
query {Le., which two receivers to send probes to, from sources 
^i,5'2£i and learn its type (after sending and processing the 
received probes, we have essentially queried and learned the 
type of that quartet). We then merge this quartet with the 
known topology so far. We continue until identifying the entire 
2-by-A^. The goal is to exactly identify the 2-by-A^ topology 
while minimizing the number of queries {i.e., set of probes 
sent to measure the quartets). This metric is important because 
it directly translates into measurement bandwidth. Additional 
performance metrics that it is desirable to keep low include: 
merging complexity and memory usage. 

IV. Lower Bound 

First, we provide a lower bound on the number of quartets 
required by any active learning algorithm to infer the 2-by-A^. 
It clearly depends on the topology we want to infer and serves 
as a baseline for the performance of the proposed algorithms. 

^ Since we focus on M = 2, i.e., only two sources Si and S'2, we represent 
the quartets (Si, S2, Ri, Rj) only by the receivers (Ri, Rj) for brevity. 
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(a) Two quartets are sufficient. (b) Three quartets are required. 



Fig. 3. Two example 2-by-A^ topologies with = 4. In (a), -7^ quartets are 
sufficient to identify the joining points, i.e., (Ri,R2) and (i^3,i^4). In (b), 
more than ^ quartets are required, e.g., {Ri, R2), {Ri, R3), and {Ri, R/^). 



Theorem 4.1: Given Gs^xu^ the number of quartets re- 
quired to be queried by any algorithm in order to identify 
all the joining points in GsxU, 5 = {^i, 6'2}, is at least [^]. 

Before proving the theorem, let us discuss some examples 
that illustrate the intuition and that this bound is not tight. 

Example 2: Fig. Oa) shows a 2-by-7V topology with N = 
4, which requires querying exactly ^ = 2 quartets in order 
to uniquely identify all the joining points. This is because, in 
this particular topology, knowing the types of (i?i,i?2) and 
(i?3,i?4) is sufficient for identifying all four joining points. 
Indeed, (i?i,i?2) is of type 4, which, according to Fig. [21 
means that both Ji and J2 lie below ^1,2; also (Rs^Ra) is 
type 4, which means that both J3 and J4 are below ^3^4. Thus, 
each joining point is identified on a single logical link. ■ 

Example 3: Fig. Ob) shows an example where ^ = ^ 
quartets are not sufficient and 3 quartets are needed to identify 
all joining points. There exist (2) = 6 possible quartets in 
this topology, from which (2) =15 pairs of quartets can be 
selected; one can check that none of the 15 possible pairs 
can uniquely identify all joining points. For example, let us 
consider (i?i,i?2). Since it is of type 1, Fig. [2] indicates that 
Ji = J2 and both of them lie above Bi 2- But there is more 
than a single link above B12', thus we continue by considering 
(Ri^Rs). It is again of type 1, which means that Ji = J3 is 
located above ^1,3. Thus, we go one step further and consider 
{Ri^R^). Since this is also type 1, Ji = J4 lies above ^1,4. 
At this step, we only have a single link between Si and ^1,4 
and thus, Ji = J2 = J3 = J4 are all identified (depicted as 
J in Fig. Ob)). Although there are other choices of triplets of 
quartets, in this topology, at least 3 quartets are required. ■ 

From these examples, one can see that the lower bound of 
is not tight and it is not achievable in every topology. 
Theorem 14.11 follows from the following lemma. 

Lemma 4.2: In order for an algorithm to identify all joining 
points for all the receivers, each receiver needs to appear in 
the set of quartets queried by the algorithm at least once. 

Proof: Assume that there exists a receiver Ri that has not 
been queried in any of the quartets. We show that even with 
complete knowledge of all other joining points, there exist at 
least two possible and feasible locations for J^, as follows. 

Location 1: Ji lies on the last incoming link to Ri, i.e., on 
the link between the parent of Ri in the Si tree (which from 
now on, we denote by parent{Ri)), and Ri. For example in 
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Fig. 4. Deletion and contraction of edge 64 in a graph. 

Fig. Oa) and Fig. Ob), assume that Ri = R2', then Location 
1 would be the link (^1,2, ^2). This is allowed by the routing 
assumptions in Section Hill because (1) there is a unique path 
P2i; (2) P2i never merges with P2j, j ^ i', and (3) P2i merges 
with Pii at Ji, and they continue together until they reach Ri. 

Location 2: Define Ji as follows. On path Pu, start at 
parent{Ri) and move up towards Si, until the first link that 
does not fully overlap with any P2j, j ^ i- Place Ji on that 
link. For example in Fig. Oa), Location 2 for J2 would be the 
link (5i,3, ^1,2); whereas in Fig. Ob), it would be (Si, B 1^4). 
This location is also allowed by the assumptions in Section HIH 
Al There is a unique path P2i. 

A2 For every j ^ i, the two paths P2i and P2j never join 
after they branch. Indeed, if Jj is located above Ji on 
Pii, then this is guaranteed by the construction of Ji. In 
contrast, Jj cannot be located below Ji on Pu since this 
would imply the violation of A2 even before adding Ji. 
A3 P2i merges with Pu at Ji and they never split. 
Thus, both Location 1 and Location 2 are valid for Ji, ac- 
cording to the routing assumptions, and Ji cannot be uniquely 
identified. Therefore, Ri needs to be queried at least once. ■ 
Theorem 14.11 follows from the following reasoning: each 
quartet involves two receivers, and thus, at least [^1 quartets 
are required for each receiver to appear in the set of quartets 
queried by the algorithm at least once. 

V. A Bottom-Up Greedy Algorithm 

In this section, we design a greedy algorithm that given 
Gsixn, and the ability to query the type of any quartet, it 
is able to identify all N joining points where Gs^xn merges 
with Gsixn, i-e., the entire 2-by-A^ topology, in — 1 steps. 

Let every edge e in Gs^xiz have a unique name: label{e). 
In our algorithm, we use two operations "edge deletion" and 
"edge contraction", depicted in Fig. (Hand defined as follows. 

Definition 1: Deleting edge {u,v), entails taking that edge 
out of the graph while the end-nodes u and v, and the labels 
of the remaining edges in the graph remain unchanged. 

Definition 2: Contracting edge (ii, v) into node w, consists 
of deleting that edge and merging u and v into a single node 
w. The labels of the remaining edges do not change (although 
nodes may be renamed to w). 

The algorithm is described in Alg. [T] It starts from the 
Si tree {Gs^xn) and proceeds by selecting one quartet to 
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(a) The Gsxn topology, (b) Gs^xn (T4). {R2, R3) (c) T3. (i^i , i^s) is of type 4; (d) T2. (i^i, i?4) is of type 3; (e) Ti. i?^ = i?4; thus J4 is 
which we want to identify. is of type 1; thus J2 = J3. thus J3 is identified on 63. thus Ji is identified on 62- identified on ei. 

Fig. 5. The steps (b), (c), (d), and (e), performed by Alg. [T]to identify the 2-by-A^ topology in (a). The output of the algorithm is J = [e2, 63, 63, ei]. 



Algorithm 1 Bottom-up merging algorithm: it starts from 
Gsixn, selects the quartets sequentially, queries their types, 
and merges them until identifying all joining points Jn- 
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Let J be a vector of length of edge labels, which represents the 
locations of the joining points, 
while |7^| > 1 do 

Pick any two receivers Ri, Rj in Gs-^xiz^ such that Ri and Rj are 
siblings; denote their parent by P. 

Query the type of {Ri, Rj). 

switch (Ri,Rj) do 
case type 1 : 

Ji = J j 

Delete Ri and edge (P,Ri). 
if outdeg(P)==l then 

Contract (P^Rj) into Rj. 
case type 2 : 

Jj = label((P, Rj)) 
Delete Rj and edge {P,Rj). 
if outdeg(P)==l then 

Contract {P,Ri) into Ri. 
case type 3 : 

Ji = label((P, Ri)) 
Delete Ri and edge (P,Ri). 
if outdeg(P)==l then 

Contract (P, Rj) into Rj. 
case type 4 : 

Jj = label((P, Rj)) 
Delete Rj and edge (P, Rj). 
if outdeg(P)==l then 

Contract {parent{P), P) into P. 
/*There is one remaining receiver, which we call i^^.*/ 
Let Jz = label((parent(Rz), Rz))- 
Output J. 



query at each step (i.e., 2 receivers Ri^Rj to send probes 
to, from sources 5'i,5'2). The two receivers (Ri^Rj) in the 
selected quartet are sibling leaves. Based on the type of the 
selected quartet, Alg. [T] identifies exactly one joining point in 
one step. It then updates Gs^xn by deleting the receiver whose 
joining point has been identified and the last incoming edge to 
that receiver. Furthermore, if a node of degree two appears 
in Gsixn as a result of this edge deletion, the algorithm 
eliminates that node by contracting the corresponding edge. 
The algorithm continues iteratively until there is one edge left, 
i.e., all joining points are identified. This way, Alg. [T] identifies 
all joining points (where paths from ^2 to each receiver join 
the Si tree), one-by-one, proceeding from the bottom to the 
root of the tree. Next, we describe an illustrative example. 

Example 4: Fig. [3b)-(e) demonstrate the steps performed 
by Alg. [T] to identify the 2-by-7V topology shown in Fig. Oa). 



The algorithm starts from Gs^xn shown in Fig. Ob); ei, ee 
are the edge labels on this tree. The algorithm first selects 
(i?2, ^3) and queries its type. Since the answer is type 1, the 
algorithm assigns J2 = J3, and deletes R2 and 65. Since the 
degree of ^2,3 becomes 2, the algorithm contracts ee into R^,. 

In the second step shown in Fig. Oc), Alg. [T] selects two 
sibling leaves (Ri^Rs), randomly out of three possible pairs of 
siblings, and queries its type. Since it is type 4, the algorithm 
identifies J3 on 63 (which, together with the previous step, 
means that J2 is also identified). It also deletes Rs and 63. 
There is no contraction in this step as Bi^^s degree is > 2. 

In the third step shown in Fig. [3d), {Ri^Ra) is selected 
and queried; it is of type 3. Therefore, the algorithm identifies 
J I on 62, deletes Ri and 62, and contracts 64 into i?4. Since 
there is only one receiver left, there are no more quartets to 
query; thus the algorithm exits the while loop and proceeds to 
the last step (line 26). For R^ = R/^^ the algorithm identifies 
J4 on ei, as shown in Fig. [3e). The identified joining points 
agree with the real locations in Gsxn topology in Fig. Oa), 
which demonstrates the correctness of the algorithm. ■ 

A. Properties of Algorithm\l\ 

Let Tn = Gsixn denote the logical tree from Si to all N 
receivers, which we assume to be known. In this section, we 
use the notation Tn to emphasize that this initial tree Gs^xn 
contains N receivers. After each iteration through the while 
loop in Alg. [U one receiver is deleted. We write Tk to denote 
the tree (rooted at Si) obtained at the end of iteration (N — k), 
at which point there are k receivers remaining. Let Jk denote 
the set of joining points, which still remain to be identified 
after iteration {N — k), i.e., one for each remaining receiver. 

Proposition 5.1: Let and Jk be given. The next iteration 
of Alg. [T] (lines 3 — 25) produces T^-i and j7/c-i, which satisfy 
the following properties: 

1) The Si topology is still a logical tree, and it has k — 
1 receivers (i.e., one receiver and its corresponding edge are 
deleted from T^). Therefore, we denote it by T^-i. 

2) One joining point has been identified; therefore, the 
algorithm has k — 1 more joining points in Jk-i to identify. 

3) All joining points in Jk-i are located on edges in T^-i. 
Proof: These properties follow directly from the opera- 
tions performed by one step of Alg. [T] 

1) In each iteration, a single receiver is eliminated from the 
tree. Consequently, the only node that can possibly have degree 
two (or out-degree one) after deleting the receiver is its parent, 
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P. However, after each deletion, Alg. [T] tests to see if P has 
out-degree 1, and if it does, then an additional contraction is 
performed so that the resulting tree, Tk-i, is still logical. 

2) When (i^^, Rj) is of type 2, 3, or 4, we can see in lines 
12, 17, and 22 of the algorithm, respectively, that one joining 
point is identified. When (Ri^Rj) is of type 1, line 7 assigns 
to Ri, the same joining point as Rj's. Then, in line 8, Ri is 
deleted so that we do not create a loop by assigning Ji again 
to Jj later. Also, Jj eventually becomes identified, either in 
one of the other types (2, 3, or 4) in the while loop, or in the 
last line of the algorithm. Thus, we have Jk-i after one step. 

3) Alg. [T] changes Tk by 2 processes: edge deletion and edge 
contraction. We show that neither deletion nor contraction can 
eliminate an edge in Tk that contains a joining point in Jk-i- 

Deletion: Alg. [T] is constructed s.t. any edge deleted from 
the Si tree contains either no joining point (if Rj) is type 
1) or exactly one joining point, corresponding to the receiver 
being removed along with that edge (if (i?^, Rj) is type 2,3,4). 

Contraction: An edge is contracted only when it does not 
contain any joining point, neither for Ri and Rj (see lines 
9-10 for type 1, lines 14 - 15 for type 2, lines 19 - 20 for 
type 3, and lines 24 — 25 for type 4), nor for any other receivers 
(since (Ri^Rj) are sibling leaves, the contracted edge cannot 
contain any joining point for any other receiver j3) ■ 

The following theorem establishes the correctness and com- 
plexity of Algorithm [T] 

Theorem 5.2: Alg. [T] terminates in N steps and correctly 
identifies all TV joining points after querying N — 1 quartets. 

Proof: The proof is via induction. In the beginning, T^ = 
Gsixn is a logical tree and according to Corollary 1 in Jl]], 
the joining points are identifiable using sufficient quartets. Our 
inductive step is one iteration of the while loop. First, note that 
there exist two sibling receivers at every step: it is enough to 
pick one of the lowest receivers (i.e., a receiver with the largest 
distance from the source); it will always have a sibling because 
of the logical tree topology. The algorithm queries one quartet 
per step, identifies one joining point per step, and at the end 
of the step, it preserves properties 1, 2, and 3. The while loop 
terminates in — 1 iterations and there is one additional step 
for Rz after the loop (which does not use any quartet). Thus, 
the algorithm terminates in N steps, and correctly identifies 
all joining points by querying exactly — 1 quartets. ■ 

Discussion. An important observation is that the A" — 1 
quartets are not known a priori, but are easily selected in a 
sequential way, as needed; this makes Alg.[T]easy to implement 
in practice using active probing. Another observation is about 
the running time: exactly A' — 1 quartets need to be queried 
(by sending sets of probes). This is much less than the (^) 
possible quartets queried by a brute-force approach [HI IH, but 
higher than the lower bound on the number of required quartets 
by any algorithm (f^], Theorem l4.lt . Therefore, Alg.[T]is not 
optimal, but it is simple, efficient, and provably correct. 

VI. A Generalized Binary Search Algorithm 

A. Background on GBS 

The GBS problem is defined as follows [i241 . Consider 
a finite (potentially very large) collection of binary-valued 

^ Alg. [T] selects siblings Ri, Rj at each step. Thus, there are only 2 potential 
candidates for the joining points that can be identified at this step: Ji, Jj. 



functions H, called the "hypothesis space", defined on a 
domain X, called the "query space". Each h e H is 3. mapping 
from X to {+1,-1}. Let \1-L\ denote the cardinality of H, 
i.e., the total number of hypotheses. The functions h e H ^tq 
assumed to be unique, and one function, /i* G produces the 
correct binary labeling, h* is assumed to be fixed but unknown. 
The goal is to determine h* through as few queries from X as 
possible. Thus, the queries need to be selected strategically in 
a sequential manner s.t. h* is identified as quickly as possible. 

This is an NP-complete problem [29]. A practical heuristic is 
given by a greedy algorithm called generalized binary search 
(GBS). In this section, we develop a GBS approach to our 
problem for the following reasons: (i) our problem is one of 
active learning and lends itself naturally to be posed in the GBS 
framework; (ii) GBS is a principled (although not optimal) ap- 
proach with provable correctness and performance guarantees 
[24]; (iii) GBS can serve as a baseline for comparison with 
Alg. [U in terms of the number of queries and complexity. 

At each step, GBS selects a query that results in the most 
even split of the hypotheses under consideration into 2 subsets, 
responding +1 and —1 respectively, to the query. The correct 
response to the query eliminates one of these two subsets from 
further consideration. The work in [|24|] characterizes the worst- 
case number of queries required by GBS in order to identify 
the correct hypothesis h*. The main result of [24] indicates that 
under certain conditions on the query and hypothesis spaces, 
the query complexity of GBS (i.e., the minimum number of 
queries required by GBS to identify h*) is near-optimal, i.e., 
within a constant factor of log2 The constant depends on 
two parameters c* and k, defined in ll24ll . and it is desirable 
that they are both as small as possible. 

B. Merging Logical Topologies in the GBS Framework 

In this section, we formulate our problem within the GBS 
framework. Consider a set of hypotheses where each 
hypothesis h e H is a. configuration that results from placing 
each joining point Ji on an arbitrary link in the path Pu in 
the tree. The query space X is the set of all queries for all 
the quartets, where each query x G asks about the type of a 
quartet (Ri^Rj). Since in our problem, each such query x has 
4 possible answers (corresponding to the 4 quartet types), we 
need to modify our queries to make them consistent with the 
binary functions in the standard GBS framework. We assume 
that each query x consists of 4 subqueries, each of which asks 
whether (Ri^Rj) is of a specific type (1, 2, 3, or 4) or not; 

{^'^" Is (Ri.Rj) of type 1? 
Is (R,,R,) of type 2? 
Is (Ri.Rj) of type 3? 
Is {Ri.Rj) of type 4? 

The answer to each such subquery is binary, which is con- 
sistent with the GBS formulation. Of course, not all four 
subqueries are always required for a quartet; one would stop 
as soon as she gets the first "yes", which would reveal the 
type of the quartet. Note, however, that we count the number 
of queries (not subqueries) as the performance metric of the 
GBS algorithm when comparing with Algorithm [T] 

Our goal is to find the target hypothesis /i*, which is the 
configuration that results from the correct placement of the 



(a) GsixTZ, star topology. (b) Gsixiz, perfect binary tree. (c) GsixTZ, tall binary tree. (d) GsixTZ, perfect ternary tree. 

Fig. 6. Four synthetic Gs-^xiz topologies used to compare the performance of Alg. [TlCthe bottom-up approach) with Alg. |2](the GBS approach). 



Algorithm 2 GBS algorithm for identifying the joining points. 

1: Let J = [0,0, ...,0] be a vector of length A^, which represents the 

locations of the joining points. 

2: while 3 in J do 

3: Let wcB = [] represent the worst case benefits for all the quartets. 

4: for each receiver Ri do 

5: for each receiver Rj, j > i do 

6: Let Bij be the lowest common ancestor of Ri, Rj in Gs^xTZ 

7: Let upi C Pu be the subset of Pu located above Bij 

8: Let upj C Pij be the subset of Pij located above Bij 

9: Let drii C Pu be the subset of Pu located below Bij 

10: Let drij C Pij be the subset of Pij located below Bi j 

type3_B= ^ 
14: ^pe4_B= 

15: wcB.append(max([typel_B, type2_B, type3_B, type4_B])) 

16: selectedQuartet=wcB. index( min( wcB ) ) 
17: Let selectedQuartetType be the type of selectedQuartet. 
18: switch selectedQuartetType do 



19 


case type 1 
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Pu ^ 
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Pij^ 




22 


case type 2 
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Pu ^ 
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24 
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drij 


25 


case type 3 
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27 
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28 


case type 4 
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31 


if |Pi-| == 1 then 


32 


Ji = Pu 




33 


if |Pi-| == 1 then 


34 


= Pij 




35 


Output J. 





joining points in the Si topology, using as few queries (i.e., 
the knowledge of as few quartet types) as possible@ 

Alg. [21 describes a greedy strategy based on GBS for deter- 
mining /i*. In the beginning, there are \1-L\ possible hypotheses. 
In each step, the algorithm selects the best (i.e., maximally 
discriminating [24]) quartet to query as follows. By querying 
a quartet and learning its type, some information is obtained 
about the locations of two joining points. Thus, the number of 
feasible hypotheses, which agree with the constraints imposed 

^More formally, /z* answers every query, for any pair of receivers, in 
accordance with the true 2-by-A^ topology. Mathematically, /i* is a mapping 
from queries to {+1,-1}, not a topology itself. However, there is a bijection 
between all 2-by-A^ logical topologies and corresponding mappings in H, and 
therefore, knowing /i* is equivalent to knowing the 2-by-A^ topology. 



by the quartets queried and learned so far, is reduced by a 
number, which depends on the topology in general. We call this 
number the benefit of the quartet. The best quartet to select to 
query is the one with maximum benefit. However, the benefit of 
each quartet becomes known only after it is queried. Thus, the 
algorithm considers all four possible types for every possible 
quartet, and focuses on the worst case benefit of that quartet, 
i.e., the type that gives the minimum benefit. The best quartet 
to query is the one with maximum worst case benefit. 

We denote the benefit of each type for a quartet (Ri^Rj) 
by typel_B, type4_B in Alg. [2l and define it as follows. 
Each quartet type limits the number of candidate edges where 
Ji and Jj can be located on, in the way depicted in Fig. [2l 
The benefit of a type for (Ri^ Rj) is the ratio of the number 
of edges where Ji and Jj can potentially be located on after 
learning this type, divided by the current number of candidate 
edges for the locations of Ji and Jj . The worst case (minimum) 
benefit of (Ri^Rj) results from the type for which this ratio 
is maximized, and the maximum of these worst case benefits 
over all quartets is given by the quartet with minimum ratio. 

In order to provide an analytical upper bound on the number 
of quartets required by Alg. |2] one can try to use the main 
result of 1I24I1 . which indicates that Alg. |2] requires log2 |H| 
quartets However, we cannot compute \1-L\ exactly in our 
problem; we can only provide a loose upper bound on that, 
which is A^!0 Thus, we get the bound of log A^! ^ log N on 
the number of quartets required by Alg. |2] which is loose, and 
much larger than the A' — 1 quartets of Alg. [T] The next section 
evaluates the performance of Algorithms [T] |2]via simulation. 

VII. Performance Evaluation 

A. Simulation Setup 

We evaluate the two algorithms in simulations over both 
synthetic topologies (as shown in Fig. [6]) and realistic topolo- 
gies (as shown in Fig. [7]). We compare them to each other as 
well as to the lower bound. The main performance metric of 
interest is the number of quartets queried in order to exactly 
infer the topology, which directly translates into measurement 
overhead. Additional metrics include the running time and the 
memory used by each algorithm. 

^This is the best case, where the constants c* and k in |3 arc both as small 
as possible. In practice, there is an additional constant factor for log2 

^The bound is obtained by starting from the Si tree and considering all 
possible placements of Ji on Pu, V i. Fig. |6lc) shows that there are x 
N X {N — 1) • • • X 2 = N \ possible such placements. In practice, the routing 
assumptions in Section ITTll impose some constraints on possible Ji locations. 
Also, the type of each quartet may rule out some types for the other quartets. 
Therefore, the exact \1-L\ depends on the topology and we cannot compute it. 
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(b) A 2-by-16 topology from Exodus l23ll . 



Fig. 7. Two realistic 2-by-A^ topologies used to compare the performance of 
Alg. [T] (Bottom-Up) with Alg. [2 (GBS). Solid lines indicate the paths taken 
by probes from Si; dashed lines indicate the paths taken by probes from S2. 

For the synthetic topologies, we illustrate only the 1-by-A^ 
tree topology of Si in Fig. [6l We consider the star topology, 
"perfect" and "tall" binary trees (referring to the topologies 
depicted in Fig. Ob) andOc), respectively), and perfect ternary 
trees, for the Gs^xn tree topology. Starting from this tree, we 
then create a 2-by-A^ topology, with sources Si and S2, by 
choosing the location of each joining point Ji (for receiver 
Ri) on a. single logical link, selected uniformly at random, on 
Pii in Gsixiz- For each Gs^xu in Fig. [6l we consider 100 
realizations of such random placements (resulting in different 
2-by-A^ topologies) and report the average number of quartets 
required for these topologies in the next section. 

For the realistic topologies, we show the complete 2-by- 
N topology in Fig. [71 Fig. [Tta) depicts a US University 
departmental LAN with 16 receivers, first used in JH. Fig. |3b) 
is a 2-by-16 directed acyclic graph (DAG), extracted from 
the Exodus topology, which is a large commercial ISP whose 
backbone map was inferred by the Rocketfuel project [|23|] . 
To generate this topology, we picked randomly two nodes of 
Exodus (nodes 5, 36) to be the sources, and selected all sixteen 
nodes to which both sources had routes to be the receivers. 
We then found the shortest path trees from each source to the 
receivers, and considered the overlap between these two trees. 

Our experiments are conducted using Python implementa- 
tions of Algorithms [T] and [21 which we have made available 
online |30[]. They take as input any topology and return the 
number of quartets required by the two algorithms. Next, we 
summarize the simulation results. 

B. Simulation Results (for the Number of Quartets) 

When Gsixn is a star topology as depicted in Fig. [3a), 
Alg. [21 always identifies the 2-by-A^ topology by querying only 
[^] quartets, which is the lower bound; thus, it is optimal and 
performs better than Alg. [T] which requires — 1 quartets. 



Comparison of Bottom-Up and GBS algorithms in perfect binary trees 
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Fig. 8. Simulation results for the average number of quartets required by 
Alg. [3 (GBS) to infer the 2-by-A^ when Gs^xiz is a perfect binary tree 
(Fig- [Sib)) of various sizes, = 4, ...,128. The results are averaged over 
100 realizations of random placements of the joining points. The standard 
deviation error bars (not shown) are comparable with the marker size. 

When Gsixn is a perfect binary tree as shown in Fig. [Sj^b), 
Alg. [2 requires different numbers of quartets, between ^ 
and TV, in different 2-by-A' topologies. However, as shown 
in Fig. [H on average, Alg. [21 performs very close to Alg. [T] 
while being much more complex than Alg. [H 

Similar results are obtained for tall binary trees and perfect 
ternary trees. Due to lack of space, we omit the figures and 
report the results. When Gs^xn is a tall binary tree as shown 
in Fig. [6tc), the number of quartets required by Alg. [3 varies 
depending on the quartet types in different 2-by-N topologies, 
but in our simulations on tall binary trees with A/^ > 100 
receivers, we observe that in at least 80% of the realizations, 
Alg. [21 requires the same number of quartets as Alg. [T] This 
percentage increases up to 99% in topologies with N < 100. 
When Gsi x7^ is a perfect ternary tree, again on average, Alg. [21 
performs close to Alg. [T] but for some topologies, Alg. [21 
requires even more than TV quartets. 

For the realistic topologies in Fig. [TJa) and [Ub), Alg. [21 
identifies both 2-by-16 topologies by querying 14 (= TV — 2) 
quartets, while Alg. [H requires TV — 1 = 15 quartets. 

Thus, in our simulations, we find that Alg. [21 only requires 
significantly fewer quartets than Alg. [Hfor flat Gs^xn topolo- 
gies, such as the star in Fig. [6ta). In other topologies, such as 
binary /ternary trees or realistic topologies, Alg. [Tlis generally 
preferred over Alg. [2] because it is simpler and identifies the 
joining points using the same number of quartets (or even 
fewer quartets in large topologies) as Alg. [21 (/.^., TV — 1). 

C. Time and Space Complexity 

1) Time Complexity: The time complexity of Alg. [21 
(0(TV^)) is significantly higher than that of Alg. [H (0(TV)). 
The reason is that at each step, Alg. [H only needs to select a 
pair of sibling receivers (any of them will do); while Alg. [21 
calculates the worst case benefits of all the quartets, in order 
to pick the best one among them, which takes much longer. 

As an example, for a single realization of our simulations 
when Gsixn is a perfect binary tree with 128 receivers, the 
running time of Alg. [21 is 19 seconds, while that of Alg. [His 
< 1 second. This is a big difference when we consider a large 
number of realizations as described in the previous section. 
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2) Memory Usage: The memory requirement of Alg. |2]is 
also much higher than that of Alg. \T\ The reason is that Alg. [T] 
only requires to store the (modified version of the) graph at 
each step; while Alg. [2] requires to keep track of all the benefits 
and the worst case benefits for all the quartets, all the path 
updates for the location of each joining point, and so forth. 

VIII. Extensions 

Due to lack of space, we only briefly outline possible 
extensions in this section; thorough description will be given 
in a later technical report/journal version. 

A. Extension to M-by-N Topologies 

So far, we have focused on inferring a 2-by-A^ topology, 
which is a special but important case. M-by-A^ topologies can 
be inferred by merging one source-rooted tree topology at a 
time. Assume that we have inferred a k-hy-N topology, 2 < 
k < M.To add the {k-\-iy^ source, we need to identify each 
joining point of Sk-\-i and Si, 1 < i < k, for each receiver, on 
a single logical link in the k-hy-N topology (defined by all 
the branching points). Therefore, we need to apply Alg. [T](or 
Alg. O to Sk-\-i and any one (in the best case) or all (in the 
worst case) of the current k sources. Thus, for example using 
Alg.[TJ the number of quartets required to identify the M-hy-N 
topology is between (M - 1)(A/' - 1) and (f ) (TV - 1). 

B. Extension to Noisy Case 

So far, we have considered the noiseless scenario, where 
the answer to each query is the correct quartet type. One can 
extend the algorithms to deal with noisy queries, using the two 
approaches proposed in [24] . The first one is a simple solution 
that applies to both Alg. [T] and Alg. [2l it repeats the query 
multiple times and considers the majority vote as the answer 
to that query. The second approach is more sophisticated and 
fits naturally in the GBS frameworkfl It assigns weights to 
each hypothesis using a probability distribution. The initial 
weighting is uniform, but it gets updated after each query. 
The update naturally boosts the probability measure of the 
hypotheses that agree with the answer to the query. At the 
end, the hypothesis with the largest weight is selected. We can 
adopt this approach for Alg. |2]by incorporating the probability 
measures in the path updates and in computing the benefits. 
Using this approach, Alg. [2] can handle the noisy queries more 
naturally than Alg. [T] The query complexity and probability 
of error of both approaches have been analyzed in [i24| . 

IX. Conclusion 

Although active topology inference is a well- studied prob- 
lem, to the best of our knowledge, this paper is the first to 
focus on efficient merging algorithms. We propose a greedy 
bottom-up approach that queries only N — 1 quartets, which 
is much less than (^) possible quartets. We also formulate 
the problem as multiple hypothesis testing and develop an 
active learning algorithm based on GBS. Comparing the two 
proposed algorithms in simulation, we find that the simple 
bottom-up algorithm is near-optimal, and comparable to the 

similar solution for Alg. \l\ would be to perform the deletions and 
contractions probabilistically. 



GBS baseline in terms of the number of queries (thus mea- 
surement bandwidth), while having much lower time and space 
complexity; therefore it is preferable for all practical purposes. 

In future work, it would be interesting to fully develop 
the possible extensions outlined in Section IVIIII and also to 
compare our algorithms against the optimal, computed, e.g., 
using dynamic programming (DP), which is both challenging 
to formulate and would have exponential complexity. 
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